Capstone Project - The Battle of the Neighborhoods

IBM & Coursera Applied Data Science Capstone

Abstract : Problem Statement

The market for Indian food is significant and growing in San Jose, California. Drawn to IT and other vocations pertaining to opportunities in technology, around ten percent of the people living in San Jose are Indian. Furthermore, Indian food is very tasty and can be appreciated by people of any ethnicity. Therefore demand for Indian food will continue to grow and there exists, in addition to the profit motive, an unparalelled opportunity to promote authentic Indian cuisine in the bay area.

Having acknowledged the opportunity that exists in the bay area, we can see that many possible locations exist where an entrepreneur can start an Indian restaurant. There exist many good locations in San Francisco, Oakland, and San Jose. For the purposes of this Data Science initiative however, we will restrict ourselves to potential venues in San Jose. This is a good restriction because the cost of opening a business is lower in San Jose as opposed to San Francisco. Furthermore, unlike Oakland, San Jose has a higher percentage of Indian people living there. Ultimately, by seeing the efficacy of out methods in San Jose, we can use similar techniques to decide upon the location of future restaurants in other Cities.

We will optimize our locations according to two criteria. First, we want to make sure there are few restaurants nearby so our new business will have geographic prominence. Once we do this, we will want to pay special attention to areas with no Indian restaurants within the proximity of our selected location. Finally, once these two conditions are met, we want to find locations as close as possible to downtown San Jose, an area with a higher concentration of tourists. More people are also likely to spontaneously discover our restaurant in downtown San Jose when exploring or just out for a night of fun.

We will then use the relevant python libraries to identify the desirable areas and remark on the relative advantages and disadvantages. This way, potential entrepreneurs have the ability to make use of a computational as well as human perspective (good up-and-coming neighbourhood) when deciding to invest a large sum of money on a risky venture.

Data

Considering the problem definition, lets identify the relevant features that will influence our estimate of a prospective venue's desirability :

  • How many existing restaurants are in the neighbourhood (any type)
  • Quantity of Indian restaurants in the neighbourhood
  • Distance of Indian restaurants from our prospective venue
  • Distance between venue address and downtown San Jose

We construct a lattice centered around Downtown San Jose. The points on the lattice will represent the potential neighbourhoods. We will obtain the data from the following sources to obtain the relevant information.

  • Automatically create the center of potential venues and use the Google Maps Geocoder API to find the addresses of those areas
  • Use the Foursquare API to find the number of restaurants and their type and location in every neighbourhoood
  • Use Google Maps Geocoder API to find the approximate address of Downtown San Jose (e.g. around Market and 1st probably)

Neighbourhood Candidates

We will create latitude and longitude coordinates for centroids of the potential neighbourhoods for our restaurant venues. This lattice will cover approximately 55 square miles and is centered around Downtown San Jose. Lets use Google Maps API to find the coordinates of downtown San Jose.

In [49]:
#Removable Code with API credentials

google_api_key = ''
address = '200 E Santa Clara St, San Jose, CA 95113' #city hall
In [2]:
import requests

#function to extract latitude and longitude from results json
def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
In [3]:
cityHall_sj = get_coordinates(google_api_key, address) #cityHall coordinates
In [4]:
# should return coordinates of city hall [37.3380937, -121.8853892]
print(cityHall_sj)
[37.3380937, -121.8853892]

These candidate areas are located within approximately 4 miles of city hall, giving a good spread of the city of San Jose. Having found the coordinates of City Hall, we now have a good point to generate our lattice. The points of this lattice, representing our candidate neighbourhoods, will all be 600 meters apart. Note, this means that the "neighbourhood" will include everything within a radius of 300 meters of the center.

Since we need to calculate the distance from city hall as part of our analysis, it is neccesary to shift from spherical coordinates to cartesian coordinates (2D). This will allow us to easily measure and tabulate all the distances in miles. Subsequently, we can also convert back from cartesian to spherical coordinates (latitude/longitude) for our Folium map. We will use the pyproj and shapely libraries to write our helper functions for coordinate transformation.

In [5]:
import shapely.geometry
import pyproj
import math

#define utiilities for coordinate transformations and finding cartesian distance

def s2c(lon,lat): #spherical to cartesian transform (lonlat_to_xy)
    proj_sphr = pyproj.Proj(proj='latlong', datum='WGS84') #spherical projection
    proj_cart = pyproj.Proj(proj="utm", zone=33, datum='WGS84') #cartesian projection
    cart = pyproj.transform(proj_sphr, proj_cart, lon, lat)
    return cart[0], cart[1]

def c2s(x, y): #cartesian to spherical transform (xy_to_lonlat)
    proj_sphr = pyproj.Proj(proj='latlong', datum='WGS84') #spherical projection
    proj_cart = pyproj.Proj(proj="utm", zone=33, datum='WGS84') #cartesian projection
    sphr = pyproj.transform(proj_cart, proj_sphr, x, y)
    return sphr[0], sphr[1]
  
def calc_xy_distance(x1, y1, x2, y2): #returns distance between two points
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)
    
In [6]:
# If the functions work as expected the transformnations will correctly
# lon, lat will have the same values as cityHall_sj[0:2]

print('Coordinate transformation check')
print('-------------------------------')
print('San Jose City Hall longitude={}, latitude={}'.format(cityHall_sj[1], cityHall_sj[0]))
x, y = s2c(cityHall_sj[1],cityHall_sj[0]) #test function on city hall coordinates
print('San Jose City Hall UTM X={}, Y={}'.format(x, y))
lon, lat = c2s(x, y) #convert back to spherical coordinates
print('San Jose City Hall longitude={}, latitude={}'.format(lon, lat))

#success!!!...?
Coordinate transformation check
-------------------------------
San Jose City Hall longitude=-121.8853892, latitude=37.3380937
San Jose City Hall UTM X=-3387770.768516667, Y=14868397.981863914
San Jose City Hall longitude=-121.8853892, latitude=37.33809370000003

Wunderbar! The coordinatate tranformation proceeded without a hitch. Now, we can construct a hexagonal lattice. To do so, we offset every other row, and adjust vertical row spacing so that each vertex is equidistant from all its neighbours.

In [7]:
cityHall_sj_x, cityHall_sj_y = s2c(cityHall_sj[1], cityHall_sj[0]) # Cartesian

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = cityHall_sj_x - 6000
x_step = 600
y_min = cityHall_sj_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []

for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(cityHall_sj_x, cityHall_sj_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = c2s(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)
        
print(len(latitudes), 'candidate neighbourhood centers generated')
364 candidate neighbourhood centers generated

Folium will help us visualize the data clues we have so far pertaining the city hall coordinates and the neighbourhood centroids.

In [8]:
import folium
In [9]:
map_sj = folium.Map(location=cityHall_sj, zoom_start=13)
folium.Marker(cityHall_sj, popup='Downtown SJ').add_to(map_sj)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=250, color='blue', fill=False).add_to(map_sj)
map_sj
Out[9]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Each of these Neighbourhoods is equidistant from each other and all are within 6km (approximately 3.75 mile) radius around City Hall.

Having generated coordinate pairs for the vertices of our hexagonal lattice, we proceed to use the Google Maps Geocoder API to get addresses for each of these vertices.

In [10]:
# function to find approximate address of a neighbourhood given latitude and longitude

def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude,longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON results =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None
In [11]:
#test reverse geocoding on city hall coordinates

addr = get_address(google_api_key,cityHall_sj[0],cityHall_sj[1])
print('Reverse geocoding test')
print('______________________')
print('Address of [{},{}] is: {}'.format(cityHall_sj[0], cityHall_sj[1], addr))
      
Reverse geocoding test
______________________
Address of [37.3380937,-121.8853892] is: 200 E Santa Clara St, San Jose, CA 95113, USA
In [12]:
# use get_address() to find addresses for our neighbourhoods

print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', USA', '') # country name is unneccesary
    addresses.append(address)
    print(' .', end='')
print(' done.')
Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.
In [13]:
addresses[85:100]
Out[13]:
['1313 E Julian St, San Jose, CA 95116',
 '5 Eggo Way, San Jose, CA 95116',
 '623 Monferino Dr, San Jose, CA 95112',
 '790 N 19th St, San Jose, CA 95112',
 '987 N 17th St, San Jose, CA 95112',
 '664 Commercial St, San Jose, CA 95112',
 '1336 Old Bayshore Hwy, San Jose, CA 95112',
 '1540 Old Bayshore Hwy, San Jose, CA 95112',
 '436 Reynolds Cir, San Jose, CA 95112',
 '1848-1866 Bering Dr, San Jose, CA 95112',
 '55 E Brokaw Rd, San Jose, CA 95112',
 '1751 Bermuda Way, San Jose, CA 95122',
 '1463 S King Rd, San Jose, CA 95122',
 '1641 Marsh St, San Jose, CA 95122',
 '647 S King Rd, San Jose, CA 95116']

Alright. These look like good example addresses in San Jose. We can move on and process these data by organizing them within a Pandas dataframe.

In [14]:
import pandas as pd

#locations dataframe
df_loc = pd.DataFrame({'Address': addresses,
                      'Latitude': latitudes,
                      'Longitude': longitudes,
                      'X': xs,
                      'Y': ys,
                      'Distance': distances_from_center}) #distance from City Hall

df_loc.head(10)
Out[14]:
Address Latitude Longitude X Y Distance
0 259 N Capitol Ave UNIT 275, San Jose, CA 95127 37.368921 -121.843773 -3.389571e+06 1.486268e+07 5992.495307
1 429 Giannotta Way, San Jose, CA 95133 37.371168 -121.848712 -3.388971e+06 1.486268e+07 5840.376700
2 675 Ranson Dr, San Jose, CA 95133 37.373415 -121.853652 -3.388371e+06 1.486268e+07 5747.173218
3 798 Ferndale Ct, San Jose, CA 95133 37.375662 -121.858592 -3.387771e+06 1.486268e+07 5715.767665
4 969 Cadet Pl, San Jose, CA 95133 37.377909 -121.863533 -3.387171e+06 1.486268e+07 5747.173218
5 1147 Ribisi Cir, San Jose, CA 95131 37.380156 -121.868474 -3.386571e+06 1.486268e+07 5840.376700
6 1885 Seville Way, San Jose, CA 95131 37.382402 -121.873416 -3.385971e+06 1.486268e+07 5992.495307
7 24 Alexander Ave, San Jose, CA 95116 37.362136 -121.838804 -3.390471e+06 1.486320e+07 5855.766389
8 195 Gramercy Pl, San Jose, CA 95116 37.364383 -121.843742 -3.389871e+06 1.486320e+07 5604.462508
9 2343 McKee Rd, San Jose, CA 95116 37.366630 -121.848681 -3.389271e+06 1.486320e+07 5408.326913
In [15]:
df_loc.to_pickle('./indian_locations.pkl')

Foursquare

Subsequent to the extraction and tabulation of our neighbourhood addresses proximal to City Hall, we can use the Foursquare API to learn about the venues within the boundaries of each neighbourhood.

We will extract the category IDs corresponding to Indian restaurants from the Foursquare website. Note, the underlying broad category is "food", having its own code. Finally, we can scan the names of the venues to see if the word "restaurant" is found in the name. Combined with the restaurant category code for an Indian restaurant, we can accurately determine both the number of restaurants as well as the number of Indian restaurants in each neighborhood, which will enable us to predict the prospective value of each location.

In [16]:
#jupyter reset importation queue
import pandas as pd
import pickle
import requests
from bs4 import BeautifulSoup
import re
In [17]:
#try to load from local system if we tried this before

df_loc = pd.read_pickle("./indian_locations.pkl")
df_loc.head()
Out[17]:
Address Latitude Longitude X Y Distance
0 259 N Capitol Ave UNIT 275, San Jose, CA 95127 37.368921 -121.843773 -3.389571e+06 1.486268e+07 5992.495307
1 429 Giannotta Way, San Jose, CA 95133 37.371168 -121.848712 -3.388971e+06 1.486268e+07 5840.376700
2 675 Ranson Dr, San Jose, CA 95133 37.373415 -121.853652 -3.388371e+06 1.486268e+07 5747.173218
3 798 Ferndale Ct, San Jose, CA 95133 37.375662 -121.858592 -3.387771e+06 1.486268e+07 5715.767665
4 969 Cadet Pl, San Jose, CA 95133 37.377909 -121.863533 -3.387171e+06 1.486268e+07 5747.173218
In [18]:
# the columns have been extracted as lists for later use

addresses = list(df_loc['Address'])
latitudes = list(df_loc['Latitude'])
longitudes = list(df_loc['Longitude'])
xs = list(df_loc['X'])
ys = list(df_loc['Y'])
distances_from_center = list(df_loc['Distance'])

Great! Having restored our data from pickle, we can hide the foursquare credentials in a specific cell below. Then we can pull and arrange the category IDs and define helper functions to manipulate the addresses and access the API.

In [48]:
# Foursquare client credentials
In [20]:
# We can use BeautifulSoup to automagically extract IDs, avoiding tedious copypasta

URL = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')

Foursquare's organized developer site has been transformed into an even more organized soup object. We can query this object according to the appropriate tags to easily find the part of the source containing the restaurant names and IDs.

In [21]:
row = soup.find_all('li')
split_index = []

txt = [row[i].get_text() for i in range(317,344)] #information for indian cuisine

for p in range(len(txt)): #nested loops to find where venue name ends and id begins
    for q in range(len(txt[p])):
        if txt[p][q].isnumeric():
            split_index.append(q)
            break

print(split_index)

txtVenue = [e[0][:e[1]] for e in zip(txt, split_index)] 
print(txtVenue)
idVenue = [e[0][e[1]:e[1]+24] for e in zip(txt, split_index)]

print(idVenue)
[17, 17, 17, 18, 11, 20, 5, 10, 15, 19, 21, 25, 17, 10, 15, 20, 17, 24, 18, 30, 23, 27, 16, 18, 21, 23, 16]
['Indian Restaurant', 'Andhra Restaurant', 'Awadhi Restaurant', 'Bengali Restaurant', 'Chaat Place', 'Chettinad Restaurant', 'Dhaba', 'Dosa Place', 'Goan Restaurant', 'Gujarati Restaurant', 'Hyderabadi Restaurant', 'Indian Chinese Restaurant', 'Indian Sweet Shop', 'Irani Cafe', 'Jain Restaurant', 'Karnataka Restaurant', 'Kerala Restaurant', 'Maharashtrian Restaurant', 'Mughlai Restaurant', 'Multicuisine Indian Restaurant', 'North Indian Restaurant', 'Northeast Indian Restaurant', 'Parsi Restaurant', 'Punjabi Restaurant', 'Rajasthani Restaurant', 'South Indian Restaurant', 'Udupi Restaurant']
['4bf58dd8d48988d10f941735', '54135bf5e4b08f3d2429dfe5', '54135bf5e4b08f3d2429dff3', '54135bf5e4b08f3d2429dff5', '54135bf5e4b08f3d2429dfe2', '54135bf5e4b08f3d2429dff2', '54135bf5e4b08f3d2429dfe1', '54135bf5e4b08f3d2429dfe3', '54135bf5e4b08f3d2429dfe8', '54135bf5e4b08f3d2429dfe9', '54135bf5e4b08f3d2429dfe6', '54135bf5e4b08f3d2429dfdf', '54135bf5e4b08f3d2429dfe4', '54135bf5e4b08f3d2429dfe7', '54135bf5e4b08f3d2429dfea', '54135bf5e4b08f3d2429dfeb', '54135bf5e4b08f3d2429dfed', '54135bf5e4b08f3d2429dfee', '54135bf5e4b08f3d2429dff4', '54135bf5e4b08f3d2429dfe0', '54135bf5e4b08f3d2429dfdd', '54135bf5e4b08f3d2429dff6', '54135bf5e4b08f3d2429dfef', '54135bf5e4b08f3d2429dff0', '54135bf5e4b08f3d2429dff1', '54135bf5e4b08f3d2429dfde', '54135bf5e4b08f3d2429dfec']
In [22]:
# These IDs have been pulled from the Foursquare site <https://developer.foursquare.com>

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
indian_restaurant_categories = idVenue #pulled from Foursquare developer page

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'place', 'chaat', 'tandoori']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', USA', '')
    address = address.replace(', US', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

# function to extract proximal venues given neighbourhood coordinates as input
# also create an associative array of all found restaurants and all found Indian restaurants

def get_restaurants(lats, lons):
    restaurants = {}
    indian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=325 to meke sure we have complete coverage & don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius=325, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_indian = is_restaurant(venue_categories, specific_filter=indian_restaurant_categories)
            if is_res:
                x, y = s2c(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_indian, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_indian:
                    indian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, indian_restaurants, location_restaurants
In [23]:
# Try to load from local file system in case we did this before
#remember!!! we changed Indian to 325m not 350m

restaurants = {}
indian_restaurants = {}
location_restaurants = []
loaded = False
try: #Indian cuisine
    with open('restaurants_325.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('indian_restaurants_325.pkl', 'rb') as f:
        indian_restaurants = pickle.load(f)
    with open('location_restaurants_325.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, indian_restaurants, location_restaurants = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_325.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('indian_restaurants_325.pkl', 'wb') as f:
        pickle.dump(indian_restaurants, f)
    with open('location_restaurants_325.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
        
Restaurant data loaded.
In [24]:
import numpy as np

print('Total number of restaurants:', len(restaurants))
print('Total number of Indian restaurants:', len(indian_restaurants))
print('Percentage of Indian restaurants: {:.2f}%'.format(len(indian_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())
Total number of restaurants: 713
Total number of Indian restaurants: 15
Percentage of Indian restaurants: 2.10%
Average number of restaurants in neighborhood: 2.3846153846153846
In [25]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))
List of all restaurants
-----------------------
('52d88e5211d2c9861d3fe824', 'Panda Express', 37.37104697664496, -121.8446722008435, '361 N Capitol Ave (McKee Rd), San Jose, CA 95133, United States', 249, False, -3389349.0978494952, 14862484.929077601)
('52abcb6211d2d07f3e035a5c', 'Chipotle Mexican Grill', 37.3711472221808, -121.84490282047204, '361 N Capitol Ave Ste 30 (361 N. Capitol Ave), San Jose, CA 95133, United States', 267, False, -3389321.3890519617, 14862485.467674306)
('4b9efba2f964a5204d0e37e3', 'Chalateco', 37.369531762051736, -121.84215205710369, '280 N Capitol Ave (at McKee), San Jose, CA 95127, United States', 158, False, -3389679.3902345304, 14862527.2581463)
('4be39090660ec9282f94cc3b', 'Taqueria Los Burritos', 37.369575567764244, -121.84495137158977, '311 N Capitol Ave Ste B, San Jose, CA 95133, United States', 127, False, -3389419.866144735, 14862668.54959784)
('558a22ea498e108469c95b5e', 'El Palacio Del Sabor', 37.36944983554564, -121.84224733951328, '280 N Capitol Ave, San Jose, CA 95127, United States', 147, False, -3389676.0201176517, 14862541.650308106)
('5a62d1a923a2e620157d3cfa', 'Mehak Of India', 37.369503, -121.844941, 'San Jose, CA 95133, United States', 121, True, -3389425.569514903, 14862676.343804732)
('55da37e5498e4af8f0045293', 'Pho Tick Tock', 37.37189963475057, -121.8461766044407, '399 N Capitol Ave, San Jose, CA 95133, United States', 238, False, -3389155.3321247753, 14862465.612668674)
('4ac415d3f964a520259e20e3', 'SJ Crawfish', 37.37183284013543, -121.84617071621577, '393 N Capitol Ave, San Jose, CA 95133, United States', 236, False, -3389160.2462593494, 14862472.977841955)
('5a4edb3b06fb606fd5b46ea6', 'Bon Chon Chicken', 37.371506, -121.846663, '377 N Capitol Ave, San Jose, CA 95133, United States', 185, False, -3389136.5176152964, 14862536.252861021)
('4ba3ff30f964a520d47538e3', 'Birreria Tepa', 37.360823323757266, -121.83723188749055, '2610 Alum Rock Ave (Capitol Ave), San Jose, CA 95116, United States', 201, False, -3390700.9305659463, 14863270.475984525)
...
Total: 713
In [26]:
print('List of Indian restaurants')
print('---------------------------')
for r in list(indian_restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(indian_restaurants))
List of Indian restaurants
---------------------------
('5a62d1a923a2e620157d3cfa', 'Mehak Of India', 37.369503, -121.844941, 'San Jose, CA 95133, United States', 121, True, -3389425.569514903, 14862676.343804732)
('4ad3d7def964a52094e620e3', 'Jewel of India', 37.3609355384021, -121.8364296875362, '2634 Alum Rock Ave, San Jose, CA 95116, United States', 248, True, -3390767.145275504, 14863215.641734704)
('57fee7b3498e55c9199c6f18', 'spice n flavor', 37.365674, -121.851624, '301 N Jackson Ave, San Jose, CA 95133, United States', 228, True, -3389063.5330501273, 14863465.489191385)
('59d97cd851950e4b69a685cb', 'Tandoor House', 37.370947, -121.917054, '1751 N. 1st Streeet, San Jose, CA 95112, United States', 55, True, -3382719.3519299435, 14866275.056158753)
('53966f83498ea919cfba5d19', 'Swaad Indian Cusine', 37.35074282782658, -121.88427285255668, '498 N 13th St (Empire), San Jose, CA 95112, United States', 111, True, -3387046.1737660808, 14866886.175847784)
('4c93cdaa6b35a143876d15dc', 'Madhuban Indian Cuisine', 37.366116781099514, -121.91493078690894, '50 Skyport Dr, San Jose, CA 95110, United States', 239, True, -3383229.591319359, 14866719.26309774)
('4bf9db805efe2d7f49f46c34', 'Punjab Cafe', 37.33931691359212, -121.88375001401245, '322 E Santa Clara St, San Jose, CA 95113, United States', 271, True, -3387841.2052061306, 14868171.845057674)
('558f686a498e7fe2f1e02530', 'Chennai Kings Express', 37.3362518229335, -121.87732496220617, '456 E San Carlos Street, San Jose, CA, United States', 179, True, -3388631.173895321, 14868188.646414842)
('4b6342c9f964a5202c6e2ae3', 'Sagar Sweets', 37.34590261115486, -121.90068188602673, '146 George St, San Jose, CA 95110, United States', 177, True, -3385857.2624013685, 14868298.667016208)
('529cec6211d2c840623421e9', 'Curry Pundits', 37.33646494940144, -121.88988520123324, '30 E Santa Clara St, San Jose, CA 95113, United States', 259, True, -3387464.683153764, 14868819.82909556)
...
Total: 15
In [27]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))
Restaurants around location
---------------------------
Restaurants around location 101: 
Restaurants around location 102: 
Restaurants around location 103: Cal Foods Mexican Deli, El Azteca
Restaurants around location 104: Plaza Garibaldi, El Aguila Restaurante y Pupuseria, Linda's Restaurant, Mi Tierra
Restaurants around location 105: El Grullense aka "El Gru", Little Caesars Pizza
Restaurants around location 106: 
Restaurants around location 107: 
Restaurants around location 108: K and C Food Wholesale, Taqueria Lorena's, K&C Seafood, Mariscos Costa Alegre, San Jose Noodle, Gecko grill, Giovanni's Pizza, bienvenidos
Restaurants around location 109: Chez Sovan, Taqueria Lorena's, Happy Smile Deli, bienvenidos, Giovanni's Pizza, Gecko grill, Mariscos Costa Alegre, San Jose Noodle
Restaurants around location 110: Subway
In [28]:
map_sj = folium.Map(location=cityHall_sj, zoom_start=13)
folium.Marker(cityHall_sj, popup='City Hall').add_to(map_sj)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_indian = res[6]
    color = 'red' if is_indian else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_sj)
map_sj
Out[28]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This looks good! We collected a sweep of all restaurants in the vicinity of City Hall. Furthermore, we also have a good idea where the Indian restaurants are. And we know which area have a high and low concentration of restaurants and Indian restaurants.

Now that we have collected the data, its key that we use it for precise analysis to generate ideal locations for new Indian restaurants.

Methodology

Here, we will focus on finding areas of San Jose that have low density of restaurants, particularly Indian restaurants. We will limit our analysis to 3.75 miles around City Hall.

Initially, we collect the required data : location and type of every restaurant in the proximity of city hall (a roughly 4 mile radius sweep centered at downtown San Jose). We also used the relevant Foursquare codes to find Indian restaurants.

Now, we are ready to create visualisations of density pertaining restaurants. Using Folium and other tools, we will take advantage of the visual appeal of heatmaps and the intuitive power and simplicity of k-means clustering to identify ideal neighbourhoods that would serve as a good starting point for a potential entrepreneur/restauranteur looking to get into the business of catering Indian cuisine.

We will make sure that the areas we are looking at have no more than two restaurants within a radius of 250 meters and zero Indian restaurants within a radius of 400 meters.

Analysis

It is time to perform introductory explanatory data analysis and provide the derivation of additional info to expand our venue location dataframe. We will begin by counting the number of restaurants in every area we are potentially evaluating.

In [29]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_loc['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

df_loc.head(10)
Average number of restaurants in every area with radius=300m: 2.3846153846153846
Out[29]:
Address Latitude Longitude X Y Distance Restaurants in area
0 259 N Capitol Ave UNIT 275, San Jose, CA 95127 37.368921 -121.843773 -3.389571e+06 1.486268e+07 5992.495307 6
1 429 Giannotta Way, San Jose, CA 95133 37.371168 -121.848712 -3.388971e+06 1.486268e+07 5840.376700 3
2 675 Ranson Dr, San Jose, CA 95133 37.373415 -121.853652 -3.388371e+06 1.486268e+07 5747.173218 0
3 798 Ferndale Ct, San Jose, CA 95133 37.375662 -121.858592 -3.387771e+06 1.486268e+07 5715.767665 0
4 969 Cadet Pl, San Jose, CA 95133 37.377909 -121.863533 -3.387171e+06 1.486268e+07 5747.173218 0
5 1147 Ribisi Cir, San Jose, CA 95131 37.380156 -121.868474 -3.386571e+06 1.486268e+07 5840.376700 0
6 1885 Seville Way, San Jose, CA 95131 37.382402 -121.873416 -3.385971e+06 1.486268e+07 5992.495307 0
7 24 Alexander Ave, San Jose, CA 95116 37.362136 -121.838804 -3.390471e+06 1.486320e+07 5855.766389 4
8 195 Gramercy Pl, San Jose, CA 95116 37.364383 -121.843742 -3.389871e+06 1.486320e+07 5604.462508 0
9 2343 McKee Rd, San Jose, CA 95116 37.366630 -121.848681 -3.389271e+06 1.486320e+07 5408.326913 8

Alright, we can now add a column indicating the distance to the nearest indian restaurant from each neighbourhood centroid. Note, that we are choosing the absolute nearest one, if there are multiple restaurants.

In [30]:
distances_to_indian_restaurant = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    for res in indian_restaurants.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_indian_restaurant.append(min_distance)

df_loc['Distance to Indian restaurant'] = distances_to_indian_restaurant
In [31]:
df_loc.head()
Out[31]:
Address Latitude Longitude X Y Distance Restaurants in area Distance to Indian restaurant
0 259 N Capitol Ave UNIT 275, San Jose, CA 95127 37.368921 -121.843773 -3.389571e+06 1.486268e+07 5992.495307 6 145.317623
1 429 Giannotta Way, San Jose, CA 95133 37.371168 -121.848712 -3.388971e+06 1.486268e+07 5840.376700 3 454.838883
2 675 Ranson Dr, San Jose, CA 95133 37.373415 -121.853652 -3.388371e+06 1.486268e+07 5747.173218 0 1045.677968
3 798 Ferndale Ct, San Jose, CA 95133 37.375662 -121.858592 -3.387771e+06 1.486268e+07 5715.767665 0 1511.542210
4 969 Cadet Pl, San Jose, CA 95133 37.377909 -121.863533 -3.387171e+06 1.486268e+07 5747.173218 0 2048.432887
In [32]:
print('Average distance to closest Indian restaurant from each area center:', df_loc['Distance to Indian restaurant'].mean())
Average distance to closest Indian restaurant from each area center: 1537.8714638249814

Note that 1500km is slightly under a mile, which would be 1609km. Hence we can take a roughly broad sweep as Foursquare is selective when it comes to classifying an "Indian" restaurant.

We can now create a heatmap and see if we can show the borders of San Jose Neighbourhoods on the map and create circles indicating how far away we are from City Hall.

In [33]:
sj_district_url = 'https://opendata.arcgis.com/datasets/001373893c8347d4b36cf15a6103f78c_120.geojson'
sj_district = requests.get(sj_district_url).json()

def district_style(feature):
    return { 'color' : 'blue', 'fill' : False}
In [34]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

indian_latlons = [[res[2], res[3]] for res in indian_restaurants.values()]
In [35]:
from folium import plugins
from folium.plugins import HeatMap

map_sj = folium.Map(location=cityHall_sj, zoom_start=13)
folium.TileLayer('cartodbdark_matter').add_to(map_sj) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
folium.Circle(cityHall_sj, radius=1000, fill=False, color='white').add_to(map_sj)
folium.Circle(cityHall_sj, radius=2000, fill=False, color='white').add_to(map_sj)
folium.Circle(cityHall_sj, radius=3000, fill=False, color='white').add_to(map_sj)
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[35]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Preliminary observation of our heatmap shows that there are pockets of low density to the north and South-East of city hall. We can create another heatmap showing Indian restaurants only.

In [36]:
map_sj = folium.Map(location = cityHall_sj, zoom_start=13)
folium.TileLayer('cartodbdark_matter').add_to(map_sj)
HeatMap(indian_latlons).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
folium.Circle(cityHall_sj, radius=1000, fill=False, color='white').add_to(map_sj)
folium.Circle(cityHall_sj, radius=2000, fill=False, color='white').add_to(map_sj)
folium.Circle(cityHall_sj, radius=3000, fill=False, color='white').add_to(map_sj)
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[36]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This map is much more sparse due to the lower prevalence of Indian restaurants in San Jose. However, we see that South San Jose and East San Jose have pockets of low Indian restaurant density.

Towards the South-West, we have little Italy, and towards the South-East we have little Saigon. We surmise that it although there are opportunities to open here, the innate nature of these neighbourhoods would not be compatible with an Indian restaurant. On the other hand, towards the South, the neighbourhoods of Washington and Spartan Keyes are ideal location candidates.

Washington-Guadalupe and Spartan Keyes

Preliminary analysis of these Neighbourhoods show a large chicana demographic. Washington has prominence as a historic district and Spartan Keyes is a notable neighbourhood of artists, art studios, and galleries. Furthermore, it is home to the south campus of San Jose State University and is also a historic neighbourhood.

"Many former warehouses and factories have been converted into art studios and galleries."

Popular with tourists, artists, and Students, these neighbourhoods justify further analysis. We can define a newer, narrow region of interest here, and find the low-restaurant parts in these areas.

In [37]:
roi_x_min = cityHall_sj_x - 4000
roi_y_max = cityHall_sj_y + 4000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 2500
roi_center_y = roi_y_max - 2500
roi_center_lon, roi_center_lat = c2s(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_sj = folium.Map(location=roi_center, zoom_start=14)
HeatMap(restaurant_latlons).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sj)
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[37]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Alright, this cross section covers the pockets within Washington / Spartan Keyes that are close to city Hall. We can create a new dense grid location candidates that are restricted to this region.

In [38]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = c2s(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')
2261 candidate neighborhood centers generated.

Now, we can evaluate the candidate areas by determining the quantity of restaurants nearby and distance to nearest Indian restaurant .

In [39]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_indian_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, indian_restaurants)
    roi_indian_distances.append(distance)
print('done.')
Generating data on location candidates... done.
In [40]:
# Let's put this into dataframe
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Restaurants nearby':roi_restaurant_counts,
                                 'Distance to Indian restaurant':roi_indian_distances})

df_roi_locations.head(10)
Out[40]:
Latitude Longitude X Y Restaurants nearby Distance to Indian restaurant
0 37.338869 -121.867944 -3.389321e+06 1.486740e+07 0 1049.138301
1 37.339243 -121.868767 -3.389221e+06 1.486740e+07 0 986.292173
2 37.336242 -121.863823 -3.389871e+06 1.486748e+07 0 1425.586946
3 37.336616 -121.864646 -3.389771e+06 1.486748e+07 0 1339.544406
4 37.336991 -121.865469 -3.389671e+06 1.486748e+07 0 1255.571699
5 37.337365 -121.866292 -3.389571e+06 1.486748e+07 0 1174.113013
6 37.337739 -121.867115 -3.389471e+06 1.486748e+07 0 1095.729183
7 37.338113 -121.867938 -3.389371e+06 1.486748e+07 0 1021.128551
8 37.338487 -121.868761 -3.389271e+06 1.486748e+07 0 951.201658
9 37.338861 -121.869585 -3.389171e+06 1.486748e+07 0 887.054491

The dataframe looks good. We can now filter the locations. We want to identify areas where there are no restaurants within 250m and no Indian Restaurants within 800m.

In [41]:
good_res_count = np.array((df_roi_locations['Restaurants nearby']<=2))
print('Locations with no more than two restaurants nearby:', good_res_count.sum())

good_ind_distance = np.array(df_roi_locations['Distance to Indian restaurant']>=800)
print('Locations with no Indian restaurants within 800m:', good_ind_distance.sum())

good_locations = np.logical_and(good_res_count, good_ind_distance)
print('Locations with both conditions met:', good_locations.sum())

df_good_locations = df_roi_locations[good_locations]
Locations with no more than two restaurants nearby: 1724
Locations with no Indian restaurants within 800m: 1737
Locations with both conditions met: 1482

Time to plot this on a heatmap.

In [42]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

map_sj = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_sj)
HeatMap(restaurant_latlons).add_to(map_sj)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.6).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sj) 
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[42]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Good. The areas in Washington-Guadalupe and Spartan Keyes suitable for development are identified. We know that there are no more than two restaurants nearby and no Indian restaurants within 800m. Any location here is good based on nearby competition.

We can indicate these locations on a heatmap:

In [43]:
map_sj = folium.Map(location=roi_center, zoom_start=14)
HeatMap(good_locations, radius=25).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sj)
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[43]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We now have a clear indication of the zones where there are few restaurants nearby and no Indian restaurants nearby. We can now use k-means clustering to find the relevant centroids and calculate addresses for the final result of our analysis.

In [44]:
from sklearn.cluster import KMeans

number_of_clusters = 10

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [c2s(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_sj = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_sj)
HeatMap(restaurant_latlons).add_to(map_sj)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_sj)
folium.Marker(cityHall_sj).add_to(map_sj)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_sj) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sj)
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[44]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The clusters cover almost all of the candidate area reasonably well and do a decent job of indicating the zones with the most valid areas. The centroids are well placed in the appropriate zones.

We can find the addresses of these areas and assume them to be good approximations for starting points for neighbourhoods to find the best possible location on specifics.

We can also observe the zones without a heatmap and use shading to indicate area.

In [45]:
map_sj = folium.Map(location=roi_center, zoom_start=14)
folium.Marker(cityHall_sj).add_to(map_sj)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_sj)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_sj)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=False).add_to(map_sj) 
folium.GeoJson(sj_district, style_function=district_style, name='geojson').add_to(map_sj)
map_sj
Out[45]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Now, we can reverse geocode these areas to get addresses which are suitable for development.

In [46]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(google_api_key, lat, lon).replace(', Germany', '')
    candidate_area_addresses.append(addr)    
    x, y = s2c(lon, lat)
    d = calc_xy_distance(x, y, cityHall_sj_x, cityHall_sj_y)
    print('{}{} => {:.1f}km from City Hall'.format(addr, ' '*(50-len(addr)), d/1000))
    
==============================================================
Addresses of centers of areas recommended for further analysis
==============================================================

1300 Senter Rd, San Jose, CA 95112, USA            => 3.0km from City Hall
717 Locust St, San Jose, CA 95110, USA             => 2.1km from City Hall
1655 Little Orchard St, San Jose, CA 95125, USA    => 4.1km from City Hall
201 Delmas Ave, San Jose, CA 95110, USA            => 1.7km from City Hall
330 Phelan Ave, San Jose, CA 95112, USA            => 3.9km from City Hall
601 Bird Ave, San Jose, CA 95125, USA              => 2.7km from City Hall
1184 Lelong St, San Jose, CA 95112, USA            => 3.6km from City Hall
295 E Virginia St, San Jose, CA 95112, USA         => 1.7km from City Hall
182 Hollywood Ave, San Jose, CA 95112, USA         => 2.8km from City Hall
1186 Woodborough Dr, San Jose, CA 95116, USA       => 2.1km from City Hall

Our analysis is done. We generated 10 addresses indicating zones where there are low restaurants and no Indian restaurants. We know these zones are close to city Hall. These addresses should be considered as a starting point for further research, and only serve as a rough approximation of zones of potential addresses. These locations are interesting due to the presence of tourists, as well as a large artist/student community, while being close to downtown San Jose.

In [47]:
map_sj = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(cityHall_sj, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_sj)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_sj) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_sj)
map_sj
Out[47]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Results and Discussion

The analysis shows that despite the large number of restaurants in San Jose, there are pockets of low density somewhat close to city Hall. Most Indian restaurants were north of San Jose, hence we focused most of our attention on the borough of Spartan Keyes, to the South of city hall. Due to the community of artists, students, and tourists, along with recent development of buildings from commercial to residential, this serves as an ideal neighbourhood to start an Indian restaurants. Another interesting borough was Washington-Guadalupe, however we focused most of our attentions on Spartan Keyes.

Subsequently, we created a closely spaced grid of location candidates and filtered for those with few restaurants nearby and no Indian restaurants nearby. Then, after clustering these zones, we used reverse geocoding to find approximate addresses as starting points for more detailed local analysis based on other factors.

This results in 10 zones containing the greatest potential new restaurant locations based on the parameters we filtered for. Naturally, these are not all optimal locations. There could easily exist other reasons to invalidate these locations. Furthermore, additional locations that are not in this area could also be excellent candidates for restaurants (close to the highway but far from city hall). The techniques used in this project only serve to illustrate one possible way for identifying desirable locations for a new venue.

Conclusion

The purpose of this project was to identify San Jose areas close to City Hall with a low number of restaurants (particularly Indian restaurants) to aid entrepreneurs, investors, speculators, and restauranteurs in narrowing down the search for optimal locations for a new Indian restaurants. By calculating restaurant density distribution from Foursquare data, we have identified general neighbourhoods that justify further analysis (Washington-Guadalupe and Spartan Keyes), then generate extensive collection of locations satisfying basic reqirements regarding nearby venues. Clustering these locations helps to identify critical zones of interest (most potential locations) and addresses of those centroids are reverse geocoded to serve as starting points for final explorations by interested parties.

Final decisions on the optimal location will be made by shareholders based on specific characteristics of neighbourhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to public transportation, for example), noise levels, prices, and social dynamics pertaining to each neighbourhood.